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Abstract 

We prove that any submodular function / : {0, 1}" — > {0, 1, fc} can be represented as a pseudo- 
Boolean 2fc-DNF formula. Pseudo-Boolean DNFs are a natural generalization of DNF representation for 
functions with integer range. Each term in such a formula has an associated integral constant. We show 
that an analog of Hastad's switching lemma holds for pseudo-Boolean fc-DNFs if all constants associated 
with the terms of the formula are bounded. 

This allows us to generalize Mansour's PAC-learning algorithm for fc-DNFs to pseudo-Boolean fc- 
DNFs, and hence gives a PAC-learning algorithm with membership queries under the uniform distri- 
bution for submodular functions of the form / : {0, 1}™ — > {0, 1, fc}. Our algorithm runs in time 
polynomial in n, fc o ( fel °g fe /<0 ; \ j e and log(l/#) and works even in the agnostic setting. The line of pre- 
vious work on learning submodular functions [Balcan, Harvey (STOC '11), Gupta, Hardt, Roth, Ullman 
(STOC '11), Cheraghchi, Klivans, Kothari, Lee (SODA '12)] implies only n°" query complexity for 
learning submodular functions in this setting, for fixed e and 8. 

Our learning algorithm implies a property tester for submodularity of functions / : {0, 1}™ 
{0, . . . , fc} with query complexity polynomial in n for fc = 0((logn/ log log n) 1 / 2 ) and constant prox- 
imity parameter e. 



*This material is based upon work supported by NSF CAREER award CCF-0845701. 
Pennsylvania State University, USA. {sofya, grigory}@cse . psu . edu. 



1 Introduction 



We investigate learning of submodular set functions, denned on the ground set [n] = {1, . . . , n}. A set 
function / : 2^ — > M. is submodular if one of the following equivalent definitions holds: 

1. /(5) + /(T)>/(5UT) + /(5nT)foraU5,TC[n]. 

2. f(S U {i}) - f(S) > f(T U {i}) - f(T) for all S C T C [n] and i G [n] \ T. 

3. /(5U{i}) + /(SU{i}) >/(5u{i,j}) + /(5)forall5C [n]mdi,j e [n]\S. 

Submodular set functions are important and widely studied, with applications in combinatorial optimiza- 
tion, economics, algorithmic game theory and many other disciplines. In many contexts, submodular func- 
tions are integral and nonnegative, and this is the setting we focus on. Examples of such functions include 
coverage functions 1 , matroid rank functions, functions modeling valuations when the value of each set is 
expressed in dollars, cut functions of graphs 2 , and cardinality-based set functions, i.e., functions of the form 
f(S) = g(\S\), where g is concave. 

We study submodular functions / : 2<- n i — > {0, 1, . . . , k}, and give a learning algorithm for this class. To 
obtain our result, we use tools from several diverse areas, ranging from operations research to complexity 
theory. 

Structural result. The first ingredient in the design of our algorithm is a structural result which shows that 
every submodular function in this class can be represented by a narrow pseudo-Boolean disjunctive normal 
form (DNF) formula, which naturally generalizes DNF for pseudo-Boolean functions. Pseudo-Boolean 
DNF formulas are well studied. (For an introduction to pseudo-Boolean functions and normal forms, see 
§13 of the book by Crama and Hammer [CH11].) 

In the next definition and the rest of the paper, we use domains 2^ and {0, l} n interchangeably. They 
are equivalent because there is a bijection between sets S C [n] and strings x\...x n G {0, l} n , where the 
bit Xi is mapped to 1 if i G S and to otherwise. 

Definition 1.1 (Pseudo-Boolean DNF). Let x\, . . . ,x n be variables taking values in {0, 1}. A pseudo- 
boolean DNF of width k and size s (also called a k-DNF of size s) is an expression of the form 

f(xi, ... ,x n ) = max (a t (^ f\ x^j ( f\ Xj^j, 

i£A t jeB t 

where at are constants, At, Bt C [n] and \ At \ + \Bt\ < kfor t £ [s], A pseudo-boolean DNF is monotone if 
it contains no negated variables, i.e., Bt = for all terms in the max expression. The class of all functions 
that have pseudo-Boolean k-DNF representations with constants a t G {0, . . . r} is denoted DNF k,r . 

It is not hard to see that every set function / : 2^ — > {0, . . . , k} has a pseudo-Boolean DNF represen- 
tation with constants at G {0, ... , k}, but in general there is no bound on the width of the formula. 

Our structural result, stated next, shows that every submodular function / : 2^ — > {0, . . . , k} can 
be represented by a pseudo-Boolean 2/c-DNF with constants a t G {0, . . . , k}. Our result is stronger for 
monotone functions, i.e., functions satisfying f(S) < f(T) for all S C T C [n]. Examples of monotone 
submodular functions include coverage functions and matroid rank functions. 

'Given sets Ai, . . . , A n in the universe U, a coverage function is f(S) — | Ujgs Aj\. 

2 Given a graph G on the vertex set [n], the cut function f(S) of G is the number of edges of G crossing the cut (S, [n]/S)). 
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Theorem 1.1 (DNF representation of submodular functions). Each submodular function f : {0, l} n — > 
{0, . . . ,k} can be represented by a pseudo-Boolean 2k-DNF with constants at £ {0, . . . , k} for all t € [s]. 
Moreover, each term of the pseudo-Boolean DNF has at most k positive and at most k negated variables, 
i.e., \Af \ < k and \ Bf \ < k. If f is monotone then its representation is a monotone pseudo-Boolean k-DNF. 

Note that the converse of Theorem 1.1 is false. E.g., consider the function f(S) that maps 5 to if 
1 5 1 < 1 and to 1 otherwise. It can be represented by a 2-DNF as follows: f(x\ . . . x n ) = max^-gui Xi A Xj. 
However, it is not submodular, since version 3 of the definition above is falsified with S = 0, i = 1 and 
3 = 2. 

Our proof of Theorem 1.1 builds on techniques developed by Gupta et al. [GHRU1 1] who show how to 
decompose a given submodular function into Lipschitz submodular functions. We first prove our structural 
result for monotone submodular functions. We use the decomposition from [GHRU11] to cover the domain 
of such a function by regions where the function is constant and then capture each region by a monotone 
term of width at most k. Then we decompose a general submodular function / into monotone regions, as in 
[GHRU1 1]. For each such region, we construct a monotone function which coincides with / on that region, 
does not exceed / everywhere else, and can be represented as a narrow pseudo-Boolean A;-DNF by invoking 
our structural result for monotone submodular functions. This construction uses a monotone extension of 
submodular functions defined by Lovasz [Lov82]. 

Learning. Our main result is a PAC-learning algorithm with membership queries under the uniform distri- 
bution for pseudo-Boolean /c-DNF, which by Theorem 1.1 also applies to submodular functions / : 2^ — > 
{0, . . . , k}. We use a (standard) variant of the PAC-learning definition given by Valiant [Val84]. 

Definition 1.2 (PAC and agnostic learning under uniform distribution). Let U n be the uniform distribution 
on {0, 1}". A class of functions C is PAC-learnable under the uniform distribution if there exists a random- 
ized algorithm A, called a PAC-learner, which for every function f £ C and every e, 5 > 0, with probability 
at least 1 — 5 over the randomness of A, outputs a hypothesis h, such that 

Pr [/»(*) ^ /(*)]< e. (1) 

x^U n 

A learning algorithm A is proper if it always outputs a hypothesis hfrom the class C. A learning algorithm 
is agnostic if it works even if the input fiinction f is arbitrary (not necessarily from C), with e replaced by 
opt + e in (1), where opt is the smallest achievable error for a hypothesis in C. 

Our algorithm accesses its input / via membership queries, i.e., by requesting f(x) on some x in /'s domain. 

Theorem 1.2. The class of pseudo-Boolean k-DNF formulas on n variables with constants in the range 
{0, . . . , r} is PAC-learnable with membership queries under the uniform distribution with running time 
polynomial in n, 

k O(k\ogr/e) > \/ € m d \og{l/ 5), even in the agnostic setting. 

Our (non-agnostic) learning algorithm is a generalization of Mansour's PAC-learner for fc-DNF [Man95]. 
It consists of running the algorithm of Kushilevitz and Mansour [KM91] for learning functions that can be 
approximated by functions with few non-zero Fourier coefficients, and thus has the same running time 
(and the same low-degree polynomial dependence on n). To be able to use this algorithm, we prove (in 
Lemma 4.1) that all functions in DNF fe ' r have this property. The agnostic version of our algorithm follows 
from the Fourier concentration result in Lemma 4. 1 and the work of Gopalan, Kalai and Klivans [GKK08]. 

The key ingredient in the proof of Lemma 4.1 (on Fourier concentration) is a generalization of Hastad's 
switching lemma [Has86, Bea94] for standard DNF formulas to pseudo-Boolean DNF. Our generalization 
(formally stated in Lemma 3.1) asserts that a function / £ DNF fe ' T , restricted on large random subset of 
variables to random Boolean values, with large probability can be computed by a decision tree of small 
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depth. (See Section 3 for definitions of random restrictions and decision trees.) Crucially, our bound on the 
probability that a random restriction of / has large decision-tree complexity is only a factor of r larger than 
the corresponding guarantee for the Boolean case. 

Theorems 1.2 and 1.1 imply the following corollary. 

Corollary 1.3. The class of submodular functions f : {0, l} n — > {0, . . . , k} is PAC-learnable with mem- 
bership queries under the uniform distribution in time polynomial in n, 

k O(k\ogk/e) and log(l/J). 

Implications for testing submodularity. Our results give property testers for submodularity of functions 
/ : 2N ->■ {0, ...,&}. A property tester [RS96, GGR98] is given oracle access to an object and a proximity 
parameter e G (0, 1). If the object has the desired property, the tester accepts it with probability at least 2/3; 
if the object is e-far from having the desired property then the tester rejects it with probability at least 2/3. 
Specifically, for properties of functions, e-far means that a given function differs on at least an e fraction of 
the domain points from any function with the property. 

As we observe in Proposition A.l, a learner for a discrete class (e.g., the class of functions / : 2^ — > 
{0, . . . , k}) can be converted to a proper learner with the same query complexity (but huge overhead in 
running time). Thus, Corollary 1.3 implies a tester for submodularity of functions / : 2^ — > {0, . . . , k} 
with query complexity polynomial in n and /^(fciogfc/e^ ma ]<Q n g progress on a question posed by Seshadhri 
[Sesll]. 

1.1 Related work 

Structural results for Boolean submodular functions. For the special case of Boolean functions, char- 
acterizations of submodular and monotone submodular functions in terms of simple DNF formulas are 
known. A Boolean function is monotone submodular if and only if it can be represented as a monotone 
1-DNF (see, e.g., Appendix A in [BH11]). A Boolean function is submodular if and only if it has a pure 
(without singleton terms) 2-DNF representation [EHP97]. 

Learning submodular functions. The problem of learning submodular functions has recently attracted 
significant interest. The focus on learning-style guarantees, which allow one to make arbitrary mistakes on 
some small portion of the domain, is justified by a negative results of Goemans et al. [GHIM09]. It demon- 
strates that every algorithm that makes a polynomial in n number of queries to a monotone submodular 
function (more specifically, even a matroid rank function) and tries to approximate it on all points in the 
domain, must make an Q,(^/n/ log n) multiplicative error on some point. 

Using results on concentration of Lipschitz submodular functions [BLM00, BLM09, Von 10] and on 
noise-stability of submodular functions [CKKL12], significant progress on learning submodular functions 
was obtained by Balcan and Harvey [BH11, BH10], Gupta et al. [GHRU11] and Cheraghchi et al. [CKKL12]. 
These works obtain learners that approximate submodular functions, as opposed to learning them exactly, 
on an e fraction of values in the domain. However, their learning algorithms generally work with weaker 
access models and with submodular functions over more general ranges. 

Balcan and Harvey's algorithms learn a function within a given multiplicative error on all but e fraction 
of the probability mass (according to a specified distribution on the domain). Their first algorithm learns 
monotone nonnegative submodular functions over 2^ within a multiplicative factor of ^/n over arbitrary 
distributions using only random examples in polynomial time. For the special case of product distributions 
and monotone nonnegative submodular functions with Lipschitz constant 1, their second algorithm can learn 
within a constant factor in polynomial time. 

Gupta et al. [GHRU11] design an algorithm that learns a submodular function with the range [0, 1] 
within a given additive error a on all but e fraction of the probability mass (according to a specified product 
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distribution on the domain). Their algorithm requires membership queries, but works even when these 
queries are answered with additive error a/4. It takes n°^°^>^' a ' time. 

Cheraghchi et al. [CKKL12] also work with additive error. Their learner is agnostic and only uses 
statistical queries. It produces a hypothesis which (with probability at least 1 — 5) has the expected additive 
eiTor opt + a with respect to a product distribution, where opt is the error of the best concept in the class. 
Their algorithm runs in time polynomial in n ^ 1 / ^ and log(l/5). 

Observe that the results in [GHRU11] and [CKKL12] directly imply an n 0(los(1 / e)fc2) time algorithm for 
our setting, by rescaling our input function to be in [0, 1] and setting the error a = l/(2r). The techniques 
in [GHRU11] also imply time complexity for non-agnostically learning submodular functions in this 
setting, for fixed e and 5. To the best of our knowledge, this is the best dependence on n, one can obtain 
from previous work. 

Finally, Chakrabarty and Huang [CHI 2] gave an exact learning algorithm for coverage functions, a 
subclass of monotone submodular functions. Their algorithm makes 0(n\U\) queries, where U is the size 
of the universe. (Coverage functions are defined as in Footnote 1 with additional nonnegative weight for 
each set, and f(S) equal to the weight of U ^sA, instead of the cardinality.) 

Property testing submodular functions. The study of submodularity in the context of property testing 
was initiated by Parnas, Ron and Rubinfeld [PRR03]. Seshadhri and Vondrak [SV11] gave the first sublinear 
(in the size of the domain) tester for submodularity of set functions. Their tester works for all ranges and 
has query and time complexity (iy e )°(v^ lo g n )_ They also showed a reduction from testing monotonicity to 
testing submodularity which, together with a lower bound for testing monotonicity given by Blais, Brody 
and Matulef [BBM1 1], implies a lower bound of Q(n) on the query complexity of testing submodularity for 
an arbitrary range and constant e > 0. 

Given the large gap between known upper and lower bounds on the complexity of testing submodularity, 
Seshadhri [Sesll] asked for testers for several important subclasses of submodular functions. The exact 
learner of Chakrabarty and Huang [CH12] for coverage functions, mentioned above, gives a property tester 
for this class with the same query complexity. 

For the special case of Boolean functions, in the light of the structural results mentioned above, one 
can test if a function is monotone submodular with 0(l/e) queries by using the algorithm from [PRS02] 
(Section 4.3) for testing monotone monomials. 

2 Representing submodular functions as pseudo-Boolean DNFs 

In this section, we prove Theorem 1.1 that shows that every submodular function over a bounded (nonneg- 
ative) integral range can be represented by a narrow pseudo-Boolean DNF After introducing notation used 
in the rest of the section (in Definition 2.1), we prove the theorem for the special case when / is monotone 
submodular (restated in Lemma 2.1) and then present the proof for the general case. In the proof, we give 
a recursive algorithm for constructing pseudo-Boolean DNF representation which has the same structure of 
recursive calls as the decomposition algorithm of Gupta et al. [GHRU11]. Our contribution is in showing 
how to use these calls to get a pseudo-Boolean 2&-DNF representation of the input function. 

Definition 2.1 (S l and For a set S £ 2^ 

we denote the collection of all subsets of S by and the 

collection of all supersets of S by Sft. 

Lemma 2.1 (DNF representation of monotone submodular functions). Each monotone submodular function 
f : {0, l} n {0, . . . , k} can be represented by a pseudo-Boolean monotone 2k-DNF with constants 
Of G {0, ... , k} for all t £ [s]. 
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Proof. Algorithm 1 below, with the initial call MONOTONE-DNF(/, 0), returns the collection of terms in a 
pseudo-boolean DNF representation of /. 



Algorithm 1: MONOTONE-DNF(/, S). 

input : Oracle access to / : 2^ — > {0, . . . , k}, argument S G 2^1 
output: Collection C of monotone terms of width at most k. 

1 c <- • a xi) 

ies 

2 for j e[n]\S do 

3 if/(5u{j}) >/(5)then 

4 C^CUMONOTONE-DNF(/,S'U{i}). 

5 return C 



First, note that the invariant f(S) > \S\ is maintained for every call MONOTONE-DNF(/, S). Since 
the maximum value of / is at most k, there are no calls with |5| > k. Thus, every term in the collection 
returned by MONOTONE-DNF(/, 0) has width at most k. By definition, all terms are monotone. 

Next, we show that the resulting formula ™ & x Cj exactly represents /. For all Y G 2^1 we have f(Y) > 

max d(Y) by monotonicity of /. To see that /(F) < max d(Y) let T = {Z \ Z C Y, f(Z) = f(Y)} 

Ci £ C C% £: C 

and T be a set of the smallest size in T. If there was a recursive call MONOTONE-DNF(/, T) then the term 
added by this recursive call would ensure the inequality. If T = then such a call was made. Otherwise, 
consider the set U = {T \ {j} \ j G T}. By the choice of T, we have f(Z) < f(T) for all Z G It. By 
submodularity of /, this implies that the restriction of / on is a strictly increasing function. Thus, the 
recursive call MONOTONE-DNF(/, T) was made and the term added by it guarantees the inequality. □ 

For a collection S of subsets of [n] , let fs : S — > M denote the restriction of a function / to the union of 

sets in S. We use notation 1,5 : 2^ —?■ {0, 1} for the indicator function defined by Is (Y) = 1 iff Y G |J S. 

Se-S 

Proof of Theorem 1.1. For a general submodular function, the formula can be constructed using Algo- 
rithm 2, with the initial call DNF(/, [n]). The algorithm description uses the function f™f n , defined next. 

Definition 2.2 (Function /™ on ). For a set S C [n], define the function f™ n : S± -> {0, . . . , fc} as follows: 
f% m (Y)=mm Y czcsf(Z). 

Proposition 2.2 (Proposition 2.1 in [Lov82]). If fgi is a submodular function, then f™° n is a monotone 
submodular function. 

Let S be the collection of sets S C [n] for which a recursive call is made when DNF(/, [n]) is executed. 
For a set 5 G S, let -6(5") = {j \ f(S \ {j}) < f(S)} be the set consisting of elements such that if we 
remove them from S, the value of the function does not increase. Let the monotone region of S be defined 
by S^ 1 = {Z\(S\ B(S)) acS} = S^n(5\ B(S)y. By submodularity of /, the restriction f s <± is 
a monotone nondecreasing function. 

Proposition 2.3. Fix S G S. Then /(F) > f™ on (Y) for all Y G Moreover, /(Y) = f™ on (Y)forall 
Y G S^. 

Proof. By the definition of f™ on , we have /^ on (Y) = minyczcs /(Z) < /(Y) for all Y G S±. Since the 
restriction f s <± is monotone nondecreasing, /^ on (Y) = minyczcs f(Z) = f(Y) for all Y G S 1 -^. □ 

The following proposition is implicit in [GHRU11]. For completeness, we prove it in Appendix C. 
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Algorithm 2: DNF(/, S). 
input : Oracle access to / : 2^ — )■ {0, . . . , k}, argument S G 2^. 

output: Collection C of terms, each containing at most k positive and at most k negated variables. 

1 C mon ^MONOTONE-DNF(/™ on ,0) 

2 c <- U (Ci • ( A 

Ci&Cmon ie[n]\5 

3 for j G 5 do 

4 if /(5\{i» >/(5)then 

s C^CUDNF(/,5\{i}). 
6 return C 



Proposition 2.4. For all functions f : 2^ — >■ {0, . . . , k}, the collection of all monotone regions of sets in S 
forms a cover of the domain, namely, UsesS-^ = 2^ n \ 

Lemma 2.1 and Proposition 2.2 give that the collection of terms C mon , constructed in Line 1 of Algo- 
rithm 2, corresponds to a monotone pseudo-Boolean /c-DNF representation for f™° n . By the same argument 
as in the proof of Lemma 2.1, \S\ > n — k for all S G S, since the maximum of / is at most k. Therefore, 
Line 2 of Algorithm 2 adds at most n— \S\ negated variables to every term of C mon , resulting in terms with 
at most k positive and at most k negated variables. 

It remains to prove that the constructed formula represents /. For a set S, let Cs denote the collection 
of terms obtained on Line 2 of Algorithm 2. By construction, C S {Y) = f^f n ■ l S i(Y) for all Y G 2H 
For every Y G 2M, the first part of Proposition 2.3 implies that C S {Y) = f^ on (Y) ■ l s i(Y) < f{Y), 
yielding max,g ei s Cs(Y) < f(Y). On the other hand, by Proposition 2.4, for every Y G 2^ there exists 
a set S G S, such that Y G S-^. For such 5, the second part of Proposition 2.3 implies that Cs(Y) = 
f™° n ( Y ) ■ 1 S^( Y ) = f( Y )- Therefore, / is equivalent to max Se5 C s . □ 

3 Generalization of Hastad's switching lemma for pseudo-Boolean DNFs 

The following definitions are stated for completeness and can be found in [0'D12, Man95]. 

Definition 3.1 (Decision tree). A decision tree T is a representation of a function f : {0, l} n — > R. It 

consists of a rooted binary tree in which the internal nodes are labeled by coordinates i G [n], the outgoing 
edges of each internal node are labeled and 1, and the leaves are labeled by real numbers. We insist that 
no coordinate i G [n] appears more than once on any root-to-leaf path. 

Each input x G {0, l} n corresponds to a computation path in the tree T from the root to a leaf When 
the computation path reaches an internal node labeled by a coordinate i G [n], we say that T queries X{. 
The computation path then follows the outgoing edge labeled by xi. The output ofT (and hence f) on input 
x is the label of the leaf reached by the computation path. We identify a tree with the function it computes. 

The depth s of a decision tree T is the maximum length of any root-to-leaf path in T. For a function /, 
DT-depth(/) is the minimum depth of a decision tree computing /. 

Definition 3.2 (Random restriction). A restriction p is a mapping of the input variables to {0, 1,*}. The 
function obtained from f(x\, . . . ,x n ) by applying a restriction p is denoted f\ p . The inputs of f\„ are those 
Xifor which p(xi) = * while all other variables are set according to p. 
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A variable xi is live with respect to a restriction p if p{xi) = *. The set of live variables with respect 
to p is denoted live(p). A random restriction p with parameter p £ (0, 1) is obtained by setting each Xj, 
independently, to 0, 1 or so that Pv[p(xi) = *] = p and Pr[p(xj) = 1] = Pr[/9(xj) = 0] = (1 — p)/2. 

We will prove the following generalization of the switching lemma [Has86, Bea94]. 

Lemma 3.1 (Switching lemma for pseudo-Boolean formulas). Let f £ DNF k,r and p be a random restric- 
tion with parameter p (i.e., Pr[p(xj) = *] = p). Then 

Pr[DT-depth(f\ p ) > s] < r ■ (7pk) s . 

Proof. We use the exposition of Razborov's proof of the switching lemma for Boolean functions, described 
in [Bea94], as the basis of our proof and highlight the modifications we made for non-Boolean functions. 

Define lZ n to be the set of all restrictions p on a domain of n variables that have exactly i unset variables. 
Fix some function / £ DNF fc,r , represented by a formula F, and assume that there is a total order on the 
terms of F as well as on the indices of the variables. A restriction p is applied to F in order, so that F p is 
a pseudo-Boolean DNF formula whose terms consist of those terms in F that are not falsified by p, each 
shortened by removing any variables that are satisfied by p, and taken in the order of occurrences of the 
original terms on which they are based. 

Definition 3.3 (Canonical labeled decision tree). The canonical labeled decision tree for F, denoted T(F), 
is defined inductively as follows: 

1. If F is a constant function then T(F) consists of a single leaf node labeled by the appropriate con- 
stant. 

2. If the first term C\ of F is not empty then let F' be the remainder of F so that F = max(Ci, F' ). Let 
K be the set of variables appearing in C\. The tree T(F) starts with a complete binary tree for K, 
which queries the variables in K in the order of their indices. Each leaf v a in the tree is associated 
with a restriction a which sets the variables of K according to the path from the root to v a . For each 
g, replace the leaf node v a by the subtree T(F a ). For the unique assignment a satisfying C\, also 
label the corresponding node by L a equal to the maximum of the labels assigned to the predecessors 
of this node in the tree and the integer constant associated with the term C\. 

Note that for Boolean DNF formulas the internal nodes in the canonical labeled decision tree are never 
labeled. In this case, the definition above is equivalent to that in [Bea94]. For pseudo-Boolean DNF formulas 
the label L a of the internal node a represents that the value of the formula on the leaves in the subtree of a 
is at least L a . 

Using the terminology introduced above, we can state the switching lemma as follows. 
Lemma 3.2. Let F £ DNF k ' r , s > 0, p < 1/7 and I = pn. Then 

!fe££iJZXM14! < r(w . 

Proof. Let stars(k, s) be the set of all sequences /3 = . . . , (3 t ) such that for each j £ [t], the coordinate 
Pj £ {*, — } k \ {—} k and such that the total number of *'s in all the /3j is s. 

Let S C V} n be the set of restrictions p such that |T(F| p )| > s. We will define an injective mapping 
from S to the cartesian product V}~ s x stars(k, s) x [2 s ] x [r]. 

Let F = maxj Cj. Suppose that p £ S and tt is the restriction associated with the lexicographically first 
path in T(F\ P ) of length at least s. Trim the last variables in tt along the path from the root so that |tt| = s. 
Let c be the maximum label of the node on tt (or zero, if none of the nodes on tt are labeled). Partition the 
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set of terms of F into two sets F' and F", where F' contains all terms with constants > c and F" contains 
all terms with constants < c (for Boolean formulas, c = and F = F'). We will use the subformula F' and 
tt to determine the image of p. The image of p is defined by following the path tt in the canonical labeled 
decision tree for F p and using the structure of the tree. 

Let C Vl be the first term of F' that is not set to by p. Since |tt| > 0, such a term must exist and will 
not be an empty term (otherwise, the value of F\ p is fixed to be > c). Let K be the set of variables in C Vl \p 
and let <j\ be the unique restriction of the variables in K that satisfies C Ul \p. Let tt\ be the part of tt that sets 
the variables in K. We have two cases based on whether tti = tt. 

1. If 7ri 7^ tt then by the construction of tt, restriction tt\ sets all the variables in K. Note that the 
restriction pa\ satisfies the term C Ul but since tt\^ tt the restriction pir\ does not satisfy term C Ul . 

2. If 7Ti = tt then it is possible that tt does not set all of the variables in K. In this case we shorten o\ to 
the variables in K that appear in tt\. 

Define {3\ G {*, — } k based on the fixed ordering of the variables in the term C V1 by letting the jth 
component of (3\ be * if and only if the jth variable in C Vl is set by a\. Since C Ul \ p is not the empty term, 
Pi has at least one *. From C Ul and f3\ we can reconstruct a\. 

Now by the definition of T(F\ p ), the restriction tt \ tt\ labels a path in the canonical labeled decision tree 
T(F\ p7T1 ). If 7Ti 7^ tt, we repeat the argument above, replacing tt and p with tt \ tt\ and p-ir\, respectively, 
and find a term C U2 which is the first term of F' not set to by piT\. Based on this, we generate TT2, a 2 and 
/?2, as before. We repeat this process until the round t in which tt\tt2 ■ ■ ■ nt = tt. 

Let a = o\G2 ■ ■ ■ cr t . We define 5 € {0, 1} S to be a vector that indicates for each variable set by tt 
whether it is set to the same value as a sets it. We define the image of p in the injective mapping as a 
quadruple, (peri . . . at, . . . , S, c). Because pa G TZ^ S and (/3i, . . . , /3 t ) G stars(k, s) the mapping 
is as described above. 

It remains to show that the defined mapping is indeed injective. We will show how to invert it by recon- 
structing p from its image. We use c to construct F' from F. The reconstruction procedure is iterative. In 
one stage of the reconstruction we recover tt\ . . . tt^ , a\ . . . a^i and construct pTT\ . . . TTi-iai ...at. Recall 
that for i < t the restriction pir\ . . . TTi-iai satisfies the term C Vi , but does not satisfy terms Cj for all j < 1^. 
This holds if we extend the restriction by appending a-i + i . . .at- Thus, we can recover i>i as the index of 
the first term of F' that is not falsified by pir\ . . . TTi-iai . . .at and the consant corresponding to this term is 
at least c. 

Now, based on C Ul and fy, we can determine tjj. Since we know a\ . . . ai, using the vector 5 we can 
determine 7Tj. We can now change p-K\ . . . TTi-iai . . . at to piri . . . 7r,j_i7Tj<Tj + i . . . at using the knowledge of 
TTi and Cj. Finally, given all the values of the 7^ we reconstruct p by removing the variables from m . . . irt 
from the restriction. 

The following computation completes the proof and is given in Appendix C for completeness. 
Claim 3.3 ([Bea94]). For p < 1/7 and p = £/n it holds that: 




stars(k, s)\ -2 s 





□ 



□ 



8 



4 Learning pseudo-Boolean DNFs 



In this section, we present our learning results for pseudo-Boolean fc-DNF and prove Theorem 1 .2. 

Let Rf denote the set of multiples of 2/ (r — 1) in the interval [—1,1], namely R r = { — 1,-1 + 2/ (r — 
1), 1— 2/(r— 1), 1}. First, we apply a transformation of the range by mapping {0, .. . ,r}toR r . Formally, 
in this section instead of functions / : {0, l} n — > {0, . . . , r} we consider functions /' : { — 1, l} d — > [—1, 1], 
such that f'(xi, . . . , x' n ) = 2/(r — 1) • f(x±, . . . , x n ) — 1, where x\ = 2xj — 1. Note that a learning 
algorithm for the class of functions that can be represented by pseudo-Boolean DNF formulas of width 
k with constants in the range R r implies Theorem 1.2. Thus, to simplify the presentation we will abuse 
notation and refer to this class as DNF fc,r . 

For a set S C [n], let xs be the standard Fourier basis vector and let f(S) denote the corresponding 
Fourier coefficient of a function /. 

Definition 4.1. A function g e-approximates a function f ifK[(f — g) 2 ] < e. A function is M-sparse if it 
has at most M non-zero Fourier coefficients. The Fourier degree of a function, denoted deg(f), is the size 
of the largest set, such that f(S) ^ 0. 

The following guarantee about approximation of functions in DNF fc,r by sparse functions is the key 
lemma in the proof of Theorem 1.2. 

Lemma 4.1. Every function f G DNF k,r can be e-approximated by an M-sparse function, where M = 

fcO(fclog(r/e))_ 

Proof of Lemma 4.1. We generalize the proof by Mansour [Man95], which relies on multiple applications 
of the switching lemma. Our generalization of the switching lemma allows us to obtain the following 
parameters of the key statements in the proof, which bound the L2-norm of the Fourier coefficients of large 
sets and the Li-norm of the Fourier coefficients of small sets. 

Lemma 4.2. For every function f £ DNF k ' r , 

E f 2 ^ z e ' 2 - 

S: |S|>28fclog(2r/e) 

Lemma 4.3. For every function f £ DNF k ' r , 

E \f(S)\<4r{28k) T = rk° {r) . 

S: \S\<r 

Lemmas 4.2 and 4.3 are proved in Appendix B. 

Let r = 28A;log(2r/e) and L = E|s|<r 1/(^)1- Let G = {S : \f{S)\ > e/2L and \S\ < r} and 
d( x ) = SseG f(S)xs{x)- We will show that g is M-sparse and that it e-approximates /. 

By an averaging argument, \G\ < 2L 2 /e. Thus, function g is M-sparse, where M < 2L 2 /e. By 
Lemma 4.3, L = rk 0( - r) = k oi ~ k lo s( r A)). Thus, M = k oi - k lo s( r A)), as claimed in the theorem statement. 

By the definition of g and by Parseval's identity, 

nu-9?] = Y.f 2 ( s )= E f 2 ^)+ E f 2 ^)- 

SfG S: \S\>t S: |S|<r,|/(S)|<6/2I 

By Lemma 4.2, the first summation is at most e/2. For the second summation, we get: 

E f 2 ^ ^ ( max l/WI J ( E 1/(5)1 ) <^r-L = e/2. 

This implies that E[(/ — g) 2 ] < e and thus g e-approximates /. □ 
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To get a learning algorithm and prove Theorem 1.2 we can use the sparse approximation guarantee 
of Lemma 4. 1 together with Kushilevitz-Mansour learning algorithm (for PAC-learning) or the learning 
algorithm of Gopalan, Kalai and Klivans (for agnostic learning). 

Proof of Theorem 1.2. We will use the learning algorithm of Kushilevitz and Mansour [GL89, KM91], 
which gives the following guarantee: 

Theorem 4.4 ([KM91]). Let f be a function that can be e-approximated by an M -sparse function. There 
exists a randomized algorithm, whose running time is polynomial in M, n, 1/e and log(l/<5), that given 
oracle access to f and 5 > 0, with probability at least 1 — 5 outputs a function h that 0(e) -approximates f. 

Setting M = A;°( fcl °g( r / ,: )) and the approximation parameter e in Theorem 4.4 to be e = e'/Cr 2 for 
large enough constant C we get an algorithm which returns a functions h that (e'/r^-approximates /. The 
running time of such algorithm is polynomial in n, k°( kl °g( r / e )) an d log(l/5). By Proposition 4.5, if we 
round the values of h in every point to the nearest multiple of 2/(r — 1), we will get a function h! , such that 
Pr-zgf/n [h'(x) ^ f(x)] < e, completing the proof. 

Proposition 4.5. Suppose a function g : 2^ — > [—1, 1] is an e-approximation for f : 2^ — > R r . Let h be 
the function defined by h(x) = argmin y <zR r \g(x) — y\. Then Pr x6 [/n [h(x) ^ f{x)] < e • (r — l) 2 . 

Proof of Proposition 4.5. Observe that \ f(x) — g(x)\ 2 > — l) 2 whenever f(x) ^ h(x). This implies 

Pr [h(x) + f{x)] < Pr [(r - l) 2 • \f(x) - g(x)\ 2 > 1] < E xeU n[(r - l) 2 • |/(x) - g(x)\ 2 ] 

x&U n xGU n 

< (r - l) 2 • E xeun [\f(x) - g(x)\ 2 ] < e(r - l) 2 . 
The last inequality follows from the definition of e-approximation. □ 

Extension of our learning algorithm to the agnostic setting follows from the result of Gopalan, Kalai and 
Klivans. 

Theorem 4.6 ([GKK08]). If every function f in a class C has an M-sparse e-approximation, then there is 
an agnostic learning algorithm for C with running time poly(n, M, 1/e). 

□ 
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A Converting a learner into a proper learner 

Let C be a class of discrete objects represented by functions over a domain of "size" n. 

Proposition A.l. If there exists a learning algorithm L for a class C with query complexity q(n,e) and 
running time t(n, e), then there exists a proper learning algorithm L' for C with query complexity q(n, e/2) 
and running time t(n, e/2) + \C\. 

Proof. Given parameters n, e and oracle access to a function /, the algorithm L' first runs L with parameters 
n, e/2 to obtain a hypothesis g. Then it finds and outputs a function h G C, which is closest to g, namely 
h = argmirih'£cdist(g, h'). By our assumption that L is a learning algorithm, dist(f,g) < e/2. Since 
/ E C, we have dist(g,h) < dist(g,f) < e/2. By the triangle inequality, dist(f,h) < dist(f,g) + 
dist(g, h) < e. □ 



B Fourier analysis 
B.l Proof of Lemma B.4 

Proof of Lemma B.4. Consider a random variable C supported on 2^, such that for each Xj, independently 
Pr [xi G £] = p. The random variable £ is the set of live variables in a random restriction with parameter p. 
We can rewrite L\ >t as: 
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For an arbitrary choice of £ and a subset S C £ we have: 

|/(5)| = lEa,!,...,^ [f(x!, . . . ,X n )xs{xi, ■ ■ ■ ,X n )]\ 
< E a .^|E x6jC [f(xi, X n )xs(xi, • • • , In)] 



\f\ p (S)\\live(p)=C 



where the last line follows from the observation that averaging over x% ^ C is the same as taking the 
expectation of a random restriction whose set of live variables is restricted to be C. Because the absolute 
value of every coefficient S is expected to increase, this implies that: 



E I /0 s ) 



sec 



< E r . 



£ \f\ p \S)\\live(p) = C 

SCC,\S\=t 



Ep [L ltt (f p )\live(p) = C] 



Using this together with (2) we conclude that: 



LiAf) 



E \Hs)\ 

scc,\s\=t 



□ 



B.2 Proof of Lemma 4.2 

Proof of Lemma 4.2. 

Lemma B.l ([Man95, 0'D12]). Let f : {0, l} n — > { — 1,1} and f„ be a random restriction with parameter 
p. Then 

V f 2 (S) < Pv[deg(f\ p ) > tp/2]. 

\s\>t 

Because deg(f\ p ) < DT-depth(/| p ) and thus Pr[deg(f\ p ) > tp/2] < Pr[DT-depth(/| p ) > tp/2}. By 
using Lemma 3.1 and setting p = 1/lAk and t = 28k log(2r/e), we complete the proof. □ 

B.3 Proof of Lemma 4.3 

Proof of Lemma 4.3. Let L M (/) = Z\s\=t \f(S)\ and L i(f) = Et=o L u(f) = Es 1/(5) I • 
We use the following bound on L\{f) for decision trees. 

Proposition B.2 ([KM93, 0'D12]). Consider a function f, such that DT-depth(f) < s. Then L l (f) < 2 s . 

We show the following generalization of Lemma 5.2 in [Man95] for DNF fc,r . 

Lemma B.3. Let f 6 DNF k,r and let p be a random restriction of f with parameter p < I /28k. Then 
E p [L 1 (f\ p )]<2r. 

Proof of Lemma B.3. By the definition of expectation, 

n 

E p [Li(/)] = ^Pr[DT-depth/|p = s]-E p [Lx(/|p) | DT-depth(/ | p ) = s] . 

s=0 

By Proposition B.2 for all p, such that DT-depth(/| p ) = s, it holds that L\{f) < 2 s . By Lemma 3.1 we 
have Pr[DT-depth(/|p) > s] < r(lpk) s . Therefore E p [Li(/)] < E"=o r (W 2S = r ' E"=o( 14 ^) S - For 
p < 1/ 28A; the lemma follows. □ 
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We use Lemma 5.3 from [Man95] to bound L\ :t (f) by the value of K p [L\^{f\ p )]. Because in [Man95] 
the lemma is stated for Boolean functions, we give the proof for real- valued functions in Appendix B.l for 
completeness. 

Lemma B.4 ([Man95]). For f: {0, l} n —> [—1,1] and a random restriction p with parameter p, 



C Omitted proofs from Section 2 and Section 3 
C.l Proof of Proposition 2.4 

Proof of Proposition 2.4. The proof is by induction on the value /([n]) that the function / takes on the 
largest set in its domain. The base case of induction is /([n]) = k. In this case, S consists of a single set 
S = [n], and the function / is monotone non-increasing on = S-^. Now suppose that the statement 
holds for all /, such that /([n]) > t. If f([n]) = t — 1 then for all Y, such that there exists a set Z of size 
n — 1 such that f(Z) > /([n]) and Y G there exists a set S € S, such that Y € by applying inductive 
hypothesis to f Z i- Otherwise, Y € [n]-"K completing the proof. □ 





C.2 Proof of Claim 3.3 



Proof of Claim 3.3. We have \TZ e n 



, so: 



< 



\K 



{n-iy 



We use the following bound on \stars(k, s)\. 
Proposition C.l (Lemma 2 in [Bea94]). \stars(k,s)\ < (k/ln2) s . 
Using Proposition C.l we get: 



151 \1Z e ~ s \ 
T^h < l —^-\stars(k,s)\-2 

Wn\ Wn\ 




(1 -p)ln2 



Apk 



) 



For p < 1/7, the last expression is at most (7pk) s , as claimed. 



□ 
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